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Abstract 

In this work we develop a new algorithm for regularized empirical risk minimization. Our 
method extends recent techniques of Shalev-Shwartz [02/2015], which enable a dual-free analysis 
of SDCA, to arbitrary mini-batching schemes. Moreover, our method is able to better utilize the 
information in the data defining the ERM problem. For convex loss functions, our complexity 
results match those of QUARTZ, which is a primal-dual method also allowing for arbitrary 
mini-batching schemes. The advantage of a dual-free analysis comes from the fact that it 
guarantees convergence even for non-convex loss functions, as long as the average loss is convex. 
We illustrate through experiments the utility of being able to design arbitrary mini-batching 
schemes. 


1 Introduction 


Empirical risk minimization (ERM) is a very successful and immensely popular paradigm in ma¬ 
chine learning, used to train a variety of prediction and classification models. Given examples 
Ai,, An G loss functions (j)i,... ,(j)n '■ IR™' —t M and a regularization parameter A > 0, the 

L2-regularized ERM problem is an optimization problem of the form 


min 


P{w) 


1 . .T N 

:^2^MaJw) + -\\w 


( 1 ) 


Throughout the paper we shall assume that for each i, the loss function (pi is l^-smooth with 
li > 0. That is, for all x,y G and all i G [n] := {1, 2,..., n}, we have 


\\V(pi{x) - V(pi{y)\\ < li\\x - y\\. 


( 2 ) 


Further, let Li,..., > 0 be constants for which the inequality 

\\Vpi{Ajw) - V(pi{Aj 2:)|| < Li\\w - zjj (3) 

holds for all w,z and all i and let L := max* Lj. Note that we can always bound Li < /j||Aj||. 
However, Lj can be better (smaller) than ^j||Aj||. 
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1.1 Background 

In the last few years, a lot of research effort was put into designing new efficient algorithms for 
solving this problem (and some of its modifications). The frenzy of activity was motivated by the 
realization that SGD [T], not so long ago considered the state-of-the-art method for ERM, was far 
from being optimal, and that new ideas can lead to algorithms which are far superior to SGD in 
both theory and practice. The methods that belong to this category include SAG [2], SDGA [3], 
SVRG m, S2GD [5], mS2GD [6], SAGA [7], S2GD [8], QUARTZ [9], ASDCA [TOj, prox-SDCA 
[II], IPROX-SDCA [I2], A-PROX-SDGA [I3|, AdaSDCA [H], SDNA [I5|. Methods analyzed for 
arbitrary mini-batching schemes include NSync [l6|, ALPHA [Uj and QUARTZ |9]. 

In order to find an e-solution in expectation, state of the art (non-accelerated) methods for 
solving Q only need 

0((n -h k) log(l/e)) 

steps, where each step involves the computation of the gradient \7(f)i{Ajw) for some randomly 
selected example i. The quantity k is the condition number. Typically one has k = 

for methods picking i uniformly at random, and k = ^ for methods picking i using a 

carefully designed data-dependent importance sampling. Computation of such a gradient typically 
involves work which is equivalent to reading the example Ai, that is, 0{nnz{Ai)) < dm arithmetic 
operations. 


1.2 Contributions 


In this work we develop a new algorithm for the L2-regularized ERM problem 0 . Our method 
extends a technique recently introduced by Shalev-Shwartz |18) . which enables a dual-free analysis 
of SDGA, to arbitrary mini-hatching schemes. That is, our method works at each iteration with 
a random subset of examples, chosen in an i.i.d. fashion from an arbitrary distribution. Such 
flexible schemes are useful for various reasons, including i) the development of distributed or robust 
variants of the method, ii) design of importance sampling for improving the complexity rate, iii) 
design of a sampling which is aimed at obtaining efficiencies elsewhere, such us utilizing NUMA 
(non-uniform memory access) architectures, and iv) streamlining and speeding up the processing 
of each mini-batch by means of assigning to each processor approximately even workload so as to 
reduce idle time (we do experiments with the latter setup). 

In comparison with [l8| , our method is able to better utilize the information in the data examples 
Ai,... ,An, leading to a better data-dependent bound. Eor convex loss functions, our complexity 
results match those of QUARTZ [9| in terms of the rate (the logarithmic factors differ). QUARTZ 
is a primal-dual method also allowing for arbitrary mini-batching schemes. However, while |9| 
only characterize the decay of expected risk, we also give bounds for the sequence of iterates. In 
particular, we show that for convex loss functions, our method enjoys the rate (Theorem]^ 


max 


IjVi 

Pi XpiU 


log 


'(L +A)E(o)\ 


Ae 


I ? 


where pi is the probability that coordinate i is updated in an iteration, vi,... ,Vn > 0 are certain 
“stepsize” parameters of the method associated with the sampling and data (see Q), and 
is a constant depending on the starting point. Eor instance, in the special case picking a single 
example at a time uniformly at random, we have pi = 1/n and Vi = ||Aj|p, whereby we obtain one 
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of the 0{n + k) log(l/e) rates mentioned above. The other rate can be recovered using importance 
sampling. 

The advantage of a dual-free analysis comes from the fact that it guarantees convergence even for 
non-convex loss functions, as long as the average loss is convex. This is a step toward understanding 
non-convex models. In particular, we show that for non-convex loss functions, our method enjoys 
the rate (Theorem]^ 


max 


1 

-h 

Pi 




where is a constant depending on the starting point. 

Finally, we illustrate through experiments with “chunking”—a simple load balancing technique— 
the utility of being able to design arbitrary mini-batching schemes. 


2 Algorithm 

We shall now describe the method (Algorithm [^. 

Algorithm 1 dfSDCA: Dual-Free SDCA with Arbitrary Sampling 
Parameters: Sampling S', stepsize 0 

Initialization ..., an^ G M™, set ^ Pi = Prob(i G S) 

for t > 1 do 

Sample a set St according to S 

for i £ St do 

- 9p~^{V(j)i{Aj 

yj{t) ^ 


The method encodes a family of algorithms, depending on the choice of the sampling S, which 
encodes a particular mini-batching scheme. Formally, a sampling S is a set-valued random variable 
with values being the subsets of [n], i.e., subsets of examples. In this paper, we use the terms 
“mini-batching scheme” and “sampling” interchangeably. A sampling is defined by the collection 
of probabilities Prob(S') assigned to every subset S C [n] of the examples. 

The method maintains n vectors at G M™' and a vector w G At the beginning of step t, we 
have af for all i and computed and stored in memory. We then pick a random subset St 

of the examples, according to the mini-batching scheme, and update variables at for i G St, based 
on the computation of the gradients V(j)i{Aj for i G St- This is followed by an update of 
the vector w, which is performed so as to maintain the relation 





( 4 ) 


This relation is maintained for the following reason. If w* is 


the optimal solution to 0 , then 


1 

0 = VP(w*) = - V AiV(j)i(Ajw*) + Xw*, 
n 

i=l 


( 5 ) 
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and hence w* = ^ Sr=i > where a* := —'V(j)i{Ajw*). So, if we believe that the variables a* 
converge to —V(/>i(Ajw*), it indeed does make sense to maintain Q. Why should we believe this? 
This is where the specific update of the “dual variables” Oi comes from: a* is set a convex combina¬ 
tion of its previous value and our best estimate so far of —V(/)i(Ajw*), namely, —V(f>i(Ajw^*~^^). 
Indeed, the update can be written as 

aP = (1 - + 0p~\-V4)i(Ajw‘^*-^^)). 

Why does this make sense? Because we believe that converges to w*. Admittedly, this 

reasoning is somewhat “circular”. However, a better word to describe this reasoning would be: 
“iterative”. 


3 Main Results 


Let Pi := P(i G S). We assume the knowledge of parameters ui,..., > 0 for which 


E 


Y^Aihi 

ieS 


2-1 

n 

< '^PiViWKW^ 

i=l 


( 6 ) 


Tight and easily computable formulas for such parameters can be found in m For instance, 
whenever Prob(|,S| < r) = 1, inequality Q holds with Vi = r||Aj|p. 

To simplify the exposure, we will write 


drf 


-w* 


(t) ttef II (t) 




— a,- 


i = 1,2, 


,n. 


( 7 ) 


3.1 Non-convex loss functions 

Our result will be expressed in terms of the decay of the potential ^ 

where and are defined in Q. 

Theorem 1. Assume that the average loss funetion, ^ is convex. If Q holds and we let 


6 < min 
i 


PiuX^ 

Lfvi + nA2 ’ 


( 8 ) 


then the for t >0 the potential decays exponentially to zero as 


E 




(9) 


Moreover, if we set 9 equal to the upper bound in Q, then 

r>„.ax(i + |-)lojL±k^) ^ 

* \pi MpiuJ y Xe J 


B[P{w^^i) - P{w*)] < 


4 













3.2 Convex loss functions 


Our result will be expressed in terms of the decay of the potential ^ ^ 

where Bf^ and are defined in Q. 

Theorem 2. Assume that all loss functions {4>i} are convex and satisfy (§. If we run Algorithm^ 
with parameter 0 satisfying the inequality 


6 < min 

i 


PinX 

kvi + nA’ 


( 10 ) 


then the for t > 0 the potential decays exponentially to zero as 


E 





( 11 ) 


Moreover, if we set 9 equal to the upper bound in pO]), then 


T > max 
i 



kvi 

XpiU 


log 


^ (L + A)eW ^ 


E[P{w^^'>) - P{w*)] < e 


The rate, 9, precisely matches that of the QUARTZ algorithm [9]. Quartz is the only other 
method for ERM which has been analyzed for an arbitrary mini-batching scheme. Our algorithm 
is dual-free, and as we have seen above, allows for an analysis covering the case of non-convex loss 
functions. 


4 Chunking 

In this section we illustrate one use of the ability of our method to work with an arbitrary mini¬ 
batching scheme. Further examples include the ability to design distributed variants of the method 
|20j . or the use of importance/adaptive sampling to lower the number of iterations |21l [T^ I14j. 

One marked disadvantage of standard mini-batching (“choose a subset of examples, uniformly 
at random”) used in the context of parallel processing on multicore processors is the fact that in a 
synchronous implementation there is a loss of efficiency due to the fact that the computation time 
of V(I>{AJw) may differ through i. This is caused by the data examples having varying degree of 
sparsity. We hence introduce a new sampling which mitigates this issue. 

Chunks: Choose sets Gi,... ,Gk C [n], such that = [n] and Gi C\Gj = 0 Vi,j and 

'0(i) := nnz(j4j) is similar for every i, i.e. '0(1) ~ ~ '0(^)- Instead of sampling r 

coordinates we propose a new sampling, which on each iteration t samples r sets ,..., out 

of Gi,... ,Gk and uses coordinates i G as the sampled set. We assign each core one of 

the sets for parallel computation. The advantage of this sampling lies in the fact, that the 

load of computing V(I>{AJw) for all i G Gj is similar for all j G [k]. Hence, using this sampling we 
minimize the waiting time of processors. 
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Algorithm 2 Naive Chunks 
Parameters: vector of nnz u 

Initialization n = length(u); Empty vector g and s of length n] m = max(tt) 
g[l] = l, s[l]=u[l], i = l 
for t = 2 : n do 

if 9[i\ + u[t] < m then 

g[i] = g[i] + 1 , s[{\ = s[i] + u[t] 
else 

i = i + 1, g[i] = 1, s[i] = u[t] 


How to choose Gi,, Gk' We introduce the following algorithm: 

The algorithms returns the partition of [n] into Gi,..., Gk in a sense, that the first g[l\ coor¬ 
dinates belong to Gi, next 5 [2] coordinates belong to G 2 and so on. The main advantage of this 
approach is, that it makes a preprocessing step on the dataset which takes jnst one pass throngh 
the data. On Figure [T^ throngh Figure [Tf] we show the impact of Algorithm on the probability 
of the waiting time of a single core, which we measure by the difference 


max{nnz(Aj)} 

i&St 


^ ^ nnz(Ai) 

i&St 


and 


max{nnz(G[*j)} 

ie[T] w 


1 

r 


T 


J]nnz(Gg) 


for the initial and preprocessed dataset respectively. We can observe, that the waiting time is 
smaller nsing the preprocessing. 


5 Experiments 

In all our experiments we used logistic regression. We normalized the datasets so that maxj ||Aj 
1, and fixed A = 1/n. The datasets used for experiments are summarized in Table 


Dataset 

^samples 

^features 

sparsity 

w 8 a 

49,749 

300 

3.8% 

dorothea 

800 

100,000 

0.9% 

protein 

17,766 

358 

29% 

rcvl 

20,242 

47,237 

0 . 2 % 

covl 

581,012 

54 

22 % 


Table 1: Datasets used in the experiments. 


Experiment 1. In Figure we compared the performance of Algorithm with uniform serial 
sampling against state of the art algorithms such as SGD m, sAGia and S2GD [5] in number 
of epochs. The real running time of the algorithms was 0.46s for S2GD, 0.79s for SAG, 0.47s for 
SDCA and 0.58s for SGD. In Fignre we show the convergence rate for different regnlarization 
parameters A. In Fignre we show convergence rates for different serial samplings: nniform. 
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(a) w8a initially 



(d) w8a chunked 



0 1000 2000 3000 4000 5000 6000 

Max(nnz) - Mean(nnz) 

(b) dorothea initially 



Max(nnz) - Mean(nnz) 


(c) protein initially 




(e) dorothea chunked (f) protein chunked 


Figure 1; Distribution of the difference between the maximum number of nonzeros processed by a single 
core and the mean of all nonzeros processed by each core. This difference shows us, how much time is wasted 
per core waiting on the slowest core to finish its task, therefore smaller numbers are better. The first row 
corresponds to the initial distribution while the second row shows the distribution after using Algorithm 


importance [12] and also 4 different randomly generated serial samplings. These samplings were 
generated in a controlled manner, such that random c has (maxj pi)/(mini pi) < c. All of these 
samplings have linear convergence as shown in the theory. 

Experiment 2: 


we 


New sampling vs. old sampling. In Figure 3a through Figure 
compare the performance of a standard parallel sampling against sampling of blocks Gi,... ,Gk 
output by Algorithm]^ In each iteration we measure the time by 


max{nnz(Aj)| 

i&St 


and 

max{nnz(G(i))} 

ie[T] 

for the standard and new sampling respectively. This way we measure only the computations 
done by the core which is going to finish the last in each iteration, and consider the number of 
multiplications with nonzero entries of the data matrix as a proxy for time. 
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Figure 2: LEFT: Comparison of SDCA with other state of the art methods. MIDDLE: SDCA for various values of 
A. RIGHT: SDCA run with various samplings S. 



(a) w8a with t = 5 





(b) w8a with t = 10 (c) w8a with r = 20 


(d) w8a with r = 50 






(e) dorothea with r = 5 


(f) dorothea with r 


10 (g) dorothea with r = 20 (h) dorothea with r = 50 






(i) protein with r = 5 (j) protein with r = 10 (k) protein with r = 20 (1) protein with r = 50 


Figure 3: Logistic regression with A = 1/n. Comparison between new and standard sampling with fine-tuned 
stepsizes for different values of r. 




























































































































6 Proofs 


As a first approximation, our proof is an extension of the proof of Shalev-Shwartz [18] to accom¬ 
modate an arbitrary sampling dSl 11710 IS]. For all i and t we let ui = —'V4>i{Ajw^^'>) and 
zf = af — uf We will use the following lemma. 


Lemma 3 (Evolution of and For a fixed iteration tand all i we have: 


E; 


E, 




= 9 


\af - a*f - \\uf - a*f + (1 - 6p^ ^)\\z.^ 


-iMuh-i)|| 2 ' 


29 


02 


- w*)'^VP{w^^-^^) - ^-2 


'i llJt-i)ll2 


X'''^ / • - v- / Pi 

2 = 1 

Proof. It follows that for i ^ St using the definition ([^ we have 


— z. 


( 12 ) 

(13) 






= f - 11(1 - - aD + ^p7\ 

= - a*f - (1 - 9p-^)\\af~^^ - a*f - - a*f 

+9p-^{l - 9pY\\aY'> - 

= ^Pi^ f f + (1 - ^Pi^)\\^: 


and for i ^ St we have cf = 0. Taking the expectation over St we get the result. 

For the second potential we get 


Bit-1) _ sit) 0 ||^h-l)_^*||2_||^W_^*||2 

20 
nX 






Y.P^ 


-1 /i.^h-Tip 

II • 


i&St i&St 

Taking the expectation over St, using inequality and noting that 


1 


YAizY^^ = + = VP(rc(*-^)), 


1 


n 


(14) 


2=1 


2=1 


we get 


E 


20 


Bit-P - = — y («;(*-!) - W*yAiZ. 

nX ^ 


2=1 2=1 


T zi..h-i) 
2 


>E 


2 = 1 
n 


77,2 A 2 
02 " 


E 




L ieSt 


-1 h-i )||2 
2 ^2 II 


'^^*11 h-i)ii2 


0 _ „-)TvP(«,(‘-‘)) - 4^ V 

A n 2 A 2 ^ Pi 

2=1 




□ 
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6.1 Proof of Theorem (nonconvex case) 

Combining (12) and (13), we obtain 


- a*f + {1 - epr^)\\z, 




2=1 ^ 


A 

+ 2 




i=i 


e 

2n 


E (Cf- - Ihf- - allP) + 


-T-(c 

2 ^ k ^ 


(i-1) II (t-1) 

— \\u- — a 


■f) + - w*)'^VP{w^^-^'^). 


Using ([^ we have 

By strong convexity of P, 


w 


* 112 


- w*)^VP{w^^-^^) > P{w^^-^^) - P{w*) + - 

and P{w^^~^'>) — P{w*) > — tc*|p, which together yields 


112 


W 


- w*)'^VP{w^^-^'>) > - 


.* 112 


W 


Therefore, 


E[T)h-i) - T)W] > 0 


1 A A 


;E^d‘-'T(-^A)B<-) 


n ^ 2Lj 
2=1 ^ 


A 


= 0D^*-^\ 


* 112 


It follows that E[iA(*)] < (1 — 6)D^^ and repeating this recursively we end up with E[iA(* < 
(1 — Oyu^^'^ < This concludes the proof of the first part of Theorem[^ The second part of 

the proof follows by observing that P is (L+A)-smooth, which gives P{w) — P{w*) < \\w — w 

6.2 Convex case 

For the next theorem we need an additional lemma: 

Lemma 4. Assume that 4>i are Li-smooth and convex. Then, for every w, 

1 " 1 


n 


kl^Uw) - < 2 (p{w) - P{w*) - ^\\w - (15) 

i=l * V / 
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Proof. Let gi{x) = (j)i{x) — (l)i{AJw*) — Vw*)^{x — AJw*). Clearly, gi is also /j-smooth. 
By convexity of (pi we have gi{x) > 0 for all x. It follows that gt satisfies ||V( 7 j(x)|p < 2ligi{x). 
Using the definition of gi, we obtain 


WVUAlw) - = \\Vg^{A]w)\\^ 

< 2k[UAlw) - UA]w*) - VUA1 w*)^{A]w - A]w*)]. 

Summing these terms up weighted by l/U and using ([^ we get 


(16) 


1 1 

'^-\\S/(t>i{A]w)-\/(l)i{Alw 


n^lt 

2=1 


JL 2 

*M|2 T ^ -[(l>i{Ajw) - (t>i{A]w*) - AiVpi{A]w*y {w - u;*)] 

1 ^ 

2=1 


0 


y 2 

= 2 


P{w) — ^||tc||2 — p{w*) + ^||rc*||2 + {w — w*) 
P{w)-P{w*)-^\\w-w*f . □ 


6.3 Proof of Theorem!^ 


Combining (12) and (13), we obtain 


e A 1 


\\af + {1 - epr^)\\zf 

IL ^^2 

2=1 


A 

+ 2 




02 


n 


E 


2=1 *- 


5j-(cE''-||«E‘'-a-|P) 


+ - w*yvp{w^^-^'>) 


Using the convexity of P we have P{w*) — P{w^^ > (y;(i 1) _ w*)'^'VP{w^^ and using 

Lemma 1^ we have 

Th V ^ / 

2=1 ^ ^ ^ 

+ 0 (u;(*-^) - w*yVP{w^^-^'>) 


n ^ 2 i * 2 

i=l 


This gives < (1 — 9)E^^ which concludes the first part of the Theorem |2[ The second 

part follows by observing, that P is (L + A)-smooth, which gives P{w) — P{w*) < —— rc*||2. 
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